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PREFACE 

The  RAND  Corporation  has  been  assisting  the  Air  Force  Systems 
Command  in  developing  and  teaching  a  course  in  military  systems  cost 
analysis  concepts  and  techniques  given  at  the  Air  Force  Institute  of 
Technology.  This  Msmorandim  was  prepared  for  that  course  and  is  being 
published  as  one  of  a  series  of  memorEuidimis  which  will  serve  as  course 
material  for  futiare  classes.  The  paper  should  also  be  of  Interest  to 
others  within  the  Air  Force  are  concerned  with  the  problem  of 
derivation  of  estimating  relationships. 

Basic  data  used  in  the  statistical  analyses  presented  in  this 
Memorandum  were  taken  from  actvial  historical  data  sources.  However, 
the  data  have  been  transformed  and  adjusted  to  eliminate  security  and 
proprietary  information  classifications  and  to  better  serve  the  re¬ 
quirements  for  use  as  instructional  material.  For  this  reason  the 
basic  data  should  be  regarded  as  essentially  hypothetical,  and  the 
estimating  relationships  derived  therefrom  must  be  viewed  as  illus¬ 
trative  only.  They  should  not  be  used  in  actually  making  estimates 
of  airframe  initial  tooling  cost. 


SUMMARY 


This  Memorandum  presents  illustrative  examples  of  how  statistical 
regression  analysis  may  he  used  to  derive  estimating  relationships 
from  historical  data.  The  specific  illustration  pertains  to  estima¬ 
ting  relationships  for  airframe  initial  tooling  cost  as  a  function  of 
aircraft  performance  and  physical  characteristics. 

ExHn5)les  of  simple  linear  regression,  logarithmic  linear  regres¬ 
sion,  second  degree  regression,  and  multiple  linear  regression  analyses 
are  presented  and  discussed. 

Student  problems  are  contained  in  Appendices  B,  C,  and  D. 
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I.  mTRODUCTIOH 

The  pxirpose  of  this  Memorandm  Is  to  present  a  step-hy-step  il¬ 
lustration  of  how  statistical  regression  analysis  may  he  used  in  the 
derivation  of  estimating  relationships  from  historical  data.  It  is  to 
he  \ised  as  part  of  the  Instructional  material  for  the  USAPIT  cost  anal¬ 
ysis  course  and  possibly  for  cost  analysis  training  that  ml ght.  he  given 
at  RAND. 

In  this  case  the  specific  example  refers  to  an  estimating  relation¬ 
ship  for  airframe  initial  (non-recurring)  tooling  cost  for  manned  air¬ 
craft.  The  objective  is  to  show  how  Initial  tooling  cost  may  he  related 
to  aircraft  characteristics.  Information  and  data  used  as  a  basis  for 
the  statistical  analyses  were  taken  from  actual  historical  data  sources. 
However,  the  data  have  been  transformed  and  adjusted  to  eliminate 
security  and  proprietary  information  classifications  and  to  better  serve 
the  requirements  for  use  as  instructional  material.  For  this  reason 
the  basic  data  should  be  regarded  as  essentially  hypothetical,  and  the 
estimating  relationships  derived  therefrom  must  be  viewed  as  illustra¬ 
tive  only.  They  should  not  be  used  in  actually  making  estimates  of 
airframe  Initial  tooling  cost.  The  objective  here  is  solely  to  demon¬ 
strate  manipulation  of  data  and  analytical  techniques. 

Even  thoiigh  the  information  underlying  the  analysis  must  be  con¬ 
sidered  hypothetical,  the  example  does  contain  and  illustrate  many  of 
the  major  problems  encountered  in  a  real  life  situation  —  for  exan^jle, 
a  very  small,  sample  size,  uneven  distribution  of  obsei^ations  in  the 
san^le  over  the  ranges  of  the  e3q)lanatory  variables,  the  difficulty  of 
asceirbaining  reasonably  good  explanatory  variables,  non-homogeneities 
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in  the  data,  and  the  like.  All  of  these  factors,  and  others,  compound 
the  problems  involved  in  derivation  of  statistical  estimating  relation¬ 
ships  for  use  in  military  cost  analysis  activities.  While  this  is  also 
true  in  other  fields,  the  difficvilties  seem  particularly  severe  in  the 
cost  analysis  of  advanced  weapon  systems  and  forces.  But  we  must  do 
the  best  we  can  with  a  small  and  very  often  heterogeneous  data  base. 

It  should  also  be  remembered  that  statistical  estimating  relation¬ 
ships  derived  in  such  an  adverse  environment  must  always  be  used  with 
caution,  partictilarly  when  extrapolating  to  distant  future  weapon 
systems.  For  the  most  part,  use  of  an  estimating  relationship  should 
be  viewed  as  a  point  of  departure,  to  be  modified  by  escperience,  judg¬ 
ment,  and  external  or  supplementary  information.  This  in  no  way  down¬ 
grades  the  need  for  developing  and  keeping  current  a  good  library  of 
estimating  relationships.  Without  such  a  library,  the  cost  analyst 
does  not  have  even  a  point  of  departure.  Also,  a  reasonably  complete 
stock  of  estimating  relationships  is  a  prime  prerequisite  to  being 
able  to  do  sensitivity  analysis  studies. 


II.  STATEMENT  OF  THE  pp<^TtTJ^ 


Suppose  that  we  have  collected  historical  data  on  airframe  initial 
tooling-  cost  for  l4  types  of  aircraft;  7  fighters  (F-1,  F-2,  F-7) 

and  7  hombers  (B-1,  B-2,  B-7).  (To  eliminate  the  effect  of  price 

level  changes,  these  cost  data  were  adjusted  statistically  and  expressed 
in  terns  of  1962  dollars.)  In  addition  certain  aircraft  characteristics 
data  have  been  assembled  for  each  of  the  l4  cases:  AMER  airframe  weight, 
laaximum  speed,  and  combat  radius.  All  of  this  infoimtion  is  summar¬ 
ized  in  Table  1  on  the  next  page. 

Given  the  data  contained  in  Table  1,  the  problem  is  to  tiy  to 
derive  an  estimating  relationship  for  initial  tooling  cost  (X^)  ex¬ 
pressed  as  a  function  of  one  or  more  of  the  "e^qslanatoiy"  variables 
^2^  ^3^  are  immediately  confronted  with  questions  like  the 

following:  What  explanatory  variable  or  variables  shall  we  include  in 
the  estimating  relationships?  What  functional  form  seems  appropriate? 

Shall  we  stratify  the  sample  ~  e.g.,  treat  bombers  and  fighters  sep¬ 
arately? 

Several  techniques  are  available  to  help  answer  these  questions. 
Probably  the  sliiq)lest  way  to  proceed  is  merely  to  plot  the  basic  data 
on  scatter  diagrams  ~  a  separate  plot  for  each  of  the  explanatory 

variables  (x^,  X^,  in  relation  to  the  dependent  variable  (X^).  This 
has  been  done  in  Figs.  1  -  3. 

A  brief  examination  of  Figs.  1  -  3  suggests  that: 

(1)  Of  the  three  explanatory  variables,  airframe  wei^t  (X  ) 
seems  to  best  "explain"  variations  in  initial  tooling 
cost  (Xj^),  combat  radius  (X^^)  is  second  best,  and  maxi¬ 
mum  speed  is  the  least  satisfactory. 
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TSible  1 

INITIAL  TOOLING  COST  AND  VARIOUS  AIRCRAFT  CHARACTERISTIC  DATA 


Initial  AMFR 


Aircraft 

Type 

Tooling  Cost 
(Millions  1962  $) 
(Xj^) 

Airframe  Weight 
(M  of  Lbs) 

(Xg) 

Maxlmm 
Speed  (Kn) 
(X3) 

Combat 

Radius  (N  ML) 

i\) 

F-1 

8 

7 

525 

275 

F-2 

15 

8 

575 

300 

F-3 

20 

9 

600 

250 

F-4 

4o 

15 

750 

300 

F-5 

30 

12 

800 

600 

f-6 

35 

20 

1100 

360 

P-7 

70 

25 

1200 

550 

B-1 

50 

4o 

525 

1800 

B-2 

265 

115 

550 

3000 

B-3 

no 

50 

1100 

2200 

b-4 

85 

70 

525 

1000 

B-5 

60 

50 

330 

1000 

b-6 

20 

20 

500 

800 

B-7 

165 

90 

550 

780 

SOURCE: 


Hypothetical  data 


A  Bomber 


•I* 


wm 


Fig.  2  — 
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(2)  In  the  case  of  vs.  Xg  and  Xj^  vs.  Xj^,  a  linear  func¬ 
tional  form  seems  about  as  appropriate  as  anything  else. 

(3)  Except  possibly  for  Fig.  2  (Xj^  vs.  X^),  there  does  not 
seem  to  be  a  con^jelling  reason  for  stratification  — 
e.g.j  treating  the  bombers  and  filters  separately. 

In  Figs.  1  and  3>  bombers  and  filters  tend  to  be  in 
the  same  general  path  of  an  estimating  relationship 
that  would  be  fitted  to  the  data.  Also  in  the  specific 
case  at  hand,  the  total  san^de  (lU  observations)  is 
already  small,  and  to  divide  the  saii5)le  into  bombers 
and  fighters  would  reduce  the  data  to  two  sub-sets  of 
only  7  observations  each.  We  shall  therefore  treat 
bombers  and  fighters  together  in  the  statistical  analyses 
which  follow. 

(4)  In  general,  the  scatter  diagrams  do  not  indicate  as 
close  a  relationship  between  and  the  e3q)lanatory 
variables  as  we  would  like.  But  this  is  rather  typical; 
and  we  have  to  proceed  to  develop  the  best  estimating 
relationships  we  can,  given  the  available  data.  Also, 
in  using  relationships  so  derived,  care  must  be  exer¬ 
cised  to  take  into  account  the  limitations  of  the  data 
base . 
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III.  LINEAR  NORMAL  REGRESSION  ANALYSIS  OF  INITIAL  TOOLING  COST 
AS  A  FUNCTia!f  OF  AIRFRAME  WEIGHT 

As  am  illustrative  example,  we  shall  now  proceed  to  derive  am 
estimating  relationship  for  initial  tooling  cost  as  a  function  of  air¬ 
frame  weight,  using  the  data  contained  in  the  first  two  columns  of 

Table  1.  The  specific  statistical  technique  used  will  he  a  linear 

* 

normal  regression  model. 

The  first  step  is  to  take  the  basic  data  for  and  from  Table 
1  and  conqjute  the  cross  products  and  squares,  the  sums  of  ■Uiese  items, 
and  the  sanqjle  means  for  X^  and  X^.  The  resxlLts  of  these  calculations 
are  shown  in  Table  2  on  the  next  page. 

These  data  are  now  used  to  compute  estimates  of  the  parameters 
a  and  3  in  the  linear  regression  (estimating)  equation: 

(1)  X^  =  a  +  3X2 

In  a  linear  normal  regression  model  this  amounts  to  finding  the  values 
of  a  and  p  such  that  the  sum  of  the  squares  of  the  deviations  of  the 
sample  observations  from  the  regression  line  will  be  at  a  minimum;  i.e., 

(2)  r  ^X^  -  (  a-^  3  ^2^]^  ~  minimum 

The  minimization  of  (2)  with  respect  to  a  and  3  is  a  straightforward 
calculus  problem.  The  results  of  such  a  minimization  yield  the 


For  a  detailed  discussion  of  linear  normal  regression  models, 
see  A.  M.  Mood,  Introduction  to  the  Theory  of  Statistics,  McGraw-Hill 
Book  Company,  Inc.,  1950>  PP»  291-299i  E.  Croxton  and  D.  J.  Cowden, 
Applied  General  Statistics,  Prentice-Hall,  Inc.,  19^0,  Chap.  XXII;  and 
G.  W.  Snedecor,  Statistical  Methods,  Iowa  State  College  Press,  Fourth 
Edition,  pp.  103-137. 


-10- 


Table  2 

DATA  FOR  REGRESSION  ANALJfSIS  OP  INITIAL  TOOLING 
COST  AND  AIRFRAME  WEIGHT 


Aircraft 

Type 

(Tooling 
Cost ) 

(Airframe 
Weight ) 

Xg 

F-1 

8 

7 

56 

64 

49 

F-2 

15 

8 

120 

225 

64 

F-3 

20 

9 

180 

400 

81 

F-4 

4o 

15 

600 

1,600 

225 

F-5 

30 

12 

360 

900 

144 

F-6 

35 

20 

700 

1,225 

400 

F-7 

70 

25 

1,750 

4,900 

625 

B-1 

50 

40 

2,000 

2,500 

1,600 

B-2 

265 

115 

30,475 

70,225 

13,225 

B-3 

110 

50 

5,500 

12,100 

2,500 

B-4 

85 

70 

5,950 

7,225 

4,900 

B-5 

60 

50 

3,000 

3,600 

2,500 

b-6 

20 

20 

400 

400 

4oo 

B-7 

165 

90 

14,850 

27,225 

8,100 

Totails 

973 

65,941 

132,589 

34,813 

SOURCE;  Table  1. 


Sample  size  =  N  =  14 


Mean  of  the  X^'s 

973 

69.50 

N 

=  H  = 

Mean  of  the  Xg's 

EX2 

531 

37.93 

N 

-  Iir  = 
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so-called  "nonnal  equations"  for  linear  nonnal  regression: 

(3)  S  =  Na  +32X2 

(4)  r  Xj^Xg  =aEX2 +ej:x^ 

The  relevant  numerical  values  to  be  substituted  in  (3)  and  (4) 
are  contained  in  Table  2.  They  are: 

N  =  14 

EX^  =  973 

^^2  =  531 

S  X^X2  =  65,941 

2x^=34,813 

Substituting  these  numbers  in  the  normal  equations  (3)  and  (4),  ve 
obtain : 

(5)  973  =  l4a+  531  p 

(6)  65,941  =  531  a+  34,8130 

Equations  (5)  and  (6)  must  be  solved  simultaneously  to  obtain 
estimates  of  the  regression  coefficients  (a  and  3).  To  do  this  we  must 


I 


> 


-12. 


^  =  37.928571. 

Then  multiply  equation  (5)  hy  37.928571,  obtaining  a  new  equation  (5’): 
(37.928571)(973)  =  (37.92857l)(l4a  +  5318), 


or, 

(5’)  36,904.4996  =  531a  +  20,140.07123.* 


Finally,  by  subtracting  equation  (5’)  from  equation  (6)  we  can  elim¬ 
inate  a  and  solve  the  resulting  equation  for  3  : 


(6) 

(5') 


65,941.0000  =  531a  +  34,813.00003 
~36.904.4996  =-531a  -  20.140.0712  8 

29,036.5004  =  14,672.9288  3 


"  *  1.978916. 


Having  an  estimate  of  the  regression  coefficient  3  ,  the  estimate  of  a 
may  be  calculated  by  substituting  3  =  I.978916  in  equation  (6)  and 
solving  the  resiilting  equation  for  a : 


65,941  =  531a  +  (34,813){1.978916) 

65,941  =  531a +  68,892.0027 
531a  =  -2951.0027 
a=  -5.557444. 


The  results  of  these  calc\ilations  may  be  checked  by  substituting  a  = 


Notice  that  the  value  of  equation  (5)  is  unchanged,  since  we 
multiplied  both  sides  of  the  equation  by  the  same  nirnber  (37.928571). 
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“5*557Wt-  and  g=  1. 978916  in  equation  (5): 

973  =  (i4)(-5.557MA)  +  (531)  (1.978916) 

973  =  -77.804216  +  1,050.804396 
973^  973.000180 

Thus,  the  calculated  regression  coefficients  are: 

a  =  -5.557444 
p  =  1.978916, 

and  the  estimating  equation  is: 

(7)  =  -5.5574  +  1.9789X2, 

\ihere 

=  Initial  tooling  cost  in  millions  of  I962  dollars, 

Xg  =  AMPR  airframe  weight  in  thousands  of  pounds. 

Equation  (7)  may  he  plotted  on  the  scatter  diagram  contained  in 
Pig.  1.  Tvro  plot  points  are  needed  for  this  piirpose.  Computing  the 
value  of  for  =  10  and  Xg  =  100  from  equation  (7),  we  obtain: 

=  14.2  (for  Xg  =  10) 

X^  =  192.3  (for  Xg  =  100) 


The  results  of  plotting  these  numbers  and  drawing  in  the  regression 


Speclf- 
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line  is  shovn  in  Pig.  4  (the  solid  line). 

The  regression  line  is  in  effect  an  average  relationship, 
ically  in  this  instance  it  is  that  line  about  'vdiich  the  sum  of  the 
squares  of  the  deviations  of  is  at  a  minimum.  Usually,  however,  we 
are  not  only  interested  in  averages,  hut  also  in  the  reliability  of 
these  averages.  In  the  case  of  regression  analysis,  one  measure  of 
reliability  is  the  standard  error  of  estimate  (s)  of  the  regression 
equation.  The  standard  error  of  estimate  is  defined  as  the  square 
root  of  the  unexplained  variance  of  the  X^'s  in  the  sarqsle.  This 
unexplained  variance  is  obtained  by  computing  the  difference  between 
the  total  variance  of  the  X^^'s  and  the  "explained"  variance  (the 
variance  accounted  for  by  the  regression  line).  The  shortcut  method 
for  determining  the  unexplained  variance  is: 

,  EX?  -  (ccEx,  .BElLXj*** 

(3)  =  -1 - jp-i - ii- 


The  umdjusted  standard  error  of  estimate  (S)  is  the  square  root  of 
expression  (8).  The  adjusted  standard  error  of  estimate  (s)  is  ob¬ 
tained  by  subtracting  the  number  of  parameters  in  the  regression 


The  fact  that  the  "least  squares"  fit  to  the  sample  data  produces 
u  slij^tly  negative  value  for  a  may  seem  disturbing.  This  need  not  be 
the  case,  however,  since  we  would  not  want  to  use  the  estimating  equa¬ 
tion  for  extremely  low  values  of  X^  —  certainly  not  for  X^  <  5000  lb. 

Also,  the  standard  error  of  a  is  such  that  in  a  statistical  inference 
sense  we  cannot  be  confident  that  the  universe  value  of  a  is  less  than 
zero. 

4t* 

The  concepts  of  total,  explained,  and  unexplained  variance  are 
discussed  in  more  detail  later. 

y  y 

See  F .  E .  Croxton  and  D.  J .  Cowden,  Applied  General  Statistics, 
Prentice-Hall,  Inc.,  1940;  pp.  66I-63  and  67I. 


equation  from  the  saii5)le  size  (N)  in  the  formula  for  S.  In  the  case 
of  siB5)le  normal  linear  regression,  the  number  of  parameters  in  the 
regression  equation  is  2.  Therefore,  the  formula  for  S  is: 


(9) 


s 


N  -  2 


From  (9)  it  is  clear  that  for  large  saiiq)le  sizes  (large  N)  the  adjust¬ 
ment  is  of  no  iii5)ortance .  However,  in  small  samples  —  particiilarly 
very  samples  —  the  adjustment  can  make  quite  a  difference.  In 

general  S<.S,  and  S  approaches  S  as  N  becomes  large. 

Regarding  inteipretation  of  the  standard  error  of  estimate,  the 
main  point  is  that  in  normal  linear  regression  analyses  one  mi^t 
e:q)ect  that  about  two-thirds  of  the  sarq^le  observations  would  fall 
within  a  region  bounded  by  +  1  standard  error  of  estimate  from  the 
regression  line;  about  95  per  cent  of  the  observations  within  +  2 
standard  error  of  estimate  from  the  regression  line;  and  virtually 
all  of  the  observations  within  i  3  standard  error  of  estimate.  In 
practice  these  generalizations  do  not  tend  to  hold  up  very  well  in 
very  gum'll  sample  cases.  It  should  also  be  emphasized  that  here  we 
are  ^■■a^v^ng  about  distribution  of  the  observations  ^  the  sarnie;  and 
not  about  the  reliability  or  "confidence  limits"  pertaining  to  a 
predicted  as  given  by  the  estimating  equation.  (The  subject  of 
prediction  intervals  will  be  taken  up  later.) 

Returning  now  to  our  illustrative  example,  the  standard  error  of 
estimate  is  conqjuted  by  substituting  the  appropriate  data  in  equation 
(9).  We  have  already  calculated  a  =  -5.557WI-  and  p  =  1.978916;  the 
sanqjle  size  is  l4;  and  the  required  summations  are  contained  in 


Table  2.  We  have,  therefore: 


S  -  (-5.557W)(973)  -  (l.9789l6)f65.Qin) 

-I-  5.^7.39  -  1^0.4qi.70 

=  25.01  (millions  of  1962  dollars).* 

In  Pig.  k,  a  band  of  ±  1  S  from  the  regression  line  has  been  plotted 
on  the  scatter  diagram  (the  dotted  lines). 

For  some  puiposes  ~  particularly  in  couqjaring  one  S  with  another 
-  it  is  useful  to  con5>ute  a  relative  standard  error  of  estimate.  One 
such  measure  is  the  coefficient  of  variation  (c),  vhich  relates  the 
standard  error  of  estimate  to  the  mean  of  the  sample  X^'s: 

(10)  C  -  -I-  . 

In  the  case  of  our  illustrative  exan^le,  the  mean  of  the  X3_’s  (from 
Table  2)  is  =  69,5.  The  value  of  C,  therefore,  is: 


r,  _  25.01 
'69!  50  -  3^ 


The  question  of  reliability  of  an  estimating  equation  is,  of 


course, 


As  a  mtter  of  interest,  the  unadjusted  standard 


mate  is  23 .16. 


error  of  esti- 
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a  relative  matter  --  i.e.,  relative  to  the  context  in  lAiich  the  equa¬ 
tion  is  to  he  used.  However,  as  a  general  rule,  we  would  usually 
prefer  not  to  have  a  C  as  high  as  36  per  cent.  Someldiing  like  10  to 
20  per  cent  would  he  more  desirable. 

So  far  the  question  of  reliability  nas  been  considered  in  the 
context  of  the  regression  equation  in  relation  to  the  sanrpip  observa¬ 
tions.  But  this  is  usually  not  the  context  that  is  of  greatest  interest. 
Rather  than  being  concerned  with  how  well  the  regression  equation 
describes  the  san^jle  observations  per  se,  the  analyst  is  most  usuall y 
interested  in  using  the  estimating  eqtiation  to  predict  values  of 
in  the  "pc^xilation"  or  "iinlverse"  that  the  san5>le  supposedly  r^re- 
sents.  In  the  context  of  prediction,  the  standard  error  of  estimate 
does  not  fumi^  a  good  measure  of  unceitointy  or  reliability  of  the 
estimating  (regression)  equation.  In  a  formal  sense,  what  we  would 
like  is  somewhat  as  follows.  For  a  given  value  of  the  explanatory 

A 

variable,  say  X^,  the  estimating  equation  is  used  to  obtain  a  pre¬ 
dicted  value  of  the  dependent  variable: 

A  A 

\  =  a  +  p  Xg  . 

A  A 

Then  we  would  like  to  put  a  boundary  around  X^  ~  say  X^  +  A  —  such 
that  there  is  a  certain  level  of  confidence  that  the  established 
Interval  does  indeed  "bracket"  the  "true"  value  of  X^  in  the  popula¬ 
tion.  The  subject  of  "prediction  intervals"  is  addressed  to  this 
problem. 

In  the  case  of  normal  linear  regression  it  has  been  established 
that  a  1CX)(1  -e)  per  cent  prediction  interval  for  an  estimated  value 
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of  the  dependent  variable,  say  X^,  can  be  constructed  as  follows:^ 
+  A, 

\diere 


(11) 


A  =  St, 


e^/N  -  2 


N  +  1  (Xg  - 

.  ^  SCXg-Xg)^ 


The  meaning  of  the  notation  in  (u)  is: 

S  =  standard  error  of  the  estimating  equation  from  which 
Xj^  was  obtained, 

tg,  =  the  value  of  t  obtained  from  a  table  of  Student's  "t" 
distribution  for  the  e  significance  level, 

N  =  size  of  the  sample  used  to  derive  the  estimating  equation, 

Xg  =  the  specified  value  of  the  explanatory  variable  used 
as  a  basis  for  obtaining  X^, 

Xg  =  the  mean  of  the  Xg ' s  in  the  sample  used  to  derive  the 
estimating  equation, 

•  2 

r  (Xg  -  Xg)  =  the  stun  of  squared  deviations  of  the  sanple  X,,’s  from 
their  mean. 


We  shall  now  apply  this  procedure  to  our  illustrative  exanple, 
using  the  estimating  equation  (7)  derived  previously; 


(7)  X^  =  -5-5574  +  1.9789Xg, 

and  assuming  that  we  want  to  estimate  the  value  of  X^  for  Xg  =  90,000 
lb.  From  (7)  we  have: 

* 

For  a  derivation  and  explanation  of  this  procedtire,  see  A.  M. 
Mood,  o£.  cit. .  pp.  297-99. 
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\  =  -5.557^^  +  1.9789(90), 

=  "5.557^  +  178.1010, 

=  172.5  (millions  of  1962  doHars). 


Now  let  us  assume  that  we  want  to  establish  a  95  per  cent  pre¬ 
diction  interval  around  =  172.5,  using  equation  (ll)  to  derive  the 
value  of  A.  The  necessary  data  are  as  follows: 


S  =  25.01.  This  is  the  value  of  the  standard  error  of 

estimate  cal.culated  previously  for  equation  (7). 

e  =  0.05.  Since  by  assuniption  a  0.95  prediction  interval 

is  to  be  con^juted,  then  1  -  e  =  .95;  or  e  =  0.05. 


'0.05 


N  = 


r, 

X 


=  2.179.  This  number  is  obtained  from  a  table  of  values 

for  the  "t"  distribution  contained  in  G.  W. 
Snedecor,  Statistical.  Methods,  op.  cit.,  p.  65. 
The  number  2.179  Is  found  in  the  0,05  column 
(corresponding  to e =  O.O5)  on  the  12  "degrees 
of  freedom"  row.  In  a  regression  analysis  the 
term  "degrees  of  freedom"  means  the  san^jle  size 
(n)  minus  the  number  of  parameters  in  the  re¬ 
gression  equation.  In  our  case  H  =  l4  and  there 
are  2  parameters  ( a  and  S  )  in  the  regression 
equation.  Therefore  degrees  of  freedom  =  lk~2  = 

l4.  The  sample  size  used  as  a  basis  for  developing  the 
estimating  equation  is  l4. 


S(X2  -  X2) 


^2  = 
2 


90.  By  assumption. 

37.9. 


This  is  the  value  of  the  mean  of  the  X^'s  con¬ 
tained  in  Table  2. 


1^,873.  This  quantity  is  best  computed  by  using  the 

* 

following  shortcut  method: 


The  required  nmerical  data  are  contained  in  Table  2. 


12. 
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E  (Xg  -  \  f  =  EX|  -  (EX2)^/N 

=  3^,813  -  281,961/14 

=  3^-,  813  -  20,140 

=  14,673. 

Substituting  the  above  data  in  equation  (ll),  we  have: 
A  =  (25.01)(2.179) 


=  (54.50) -y  1-25  +  (1.17)(0. 184) 
=  (54.50) V  1.465  =  (54.50)(1.21) 

=  65.95. 


Therefore,  for  X^  =  90  the  95^  prediction  interval  is: 

X^  +  A,  or  172.5  +  66.0  =  106.5  and  238.5. 

This  means  that  we  have  a  subjective  confidence  of  95y^  that  the  interval 
106.5  to  238.5  brackets  the  "true"  or  "population”  value  of  X^  corres¬ 
ponding  to  Xg  =  90.  It  should  be  enqjhasized  that  a  95^  prediction 
interval  does  not  mean  that  the  probability  is  0.95  that  the  "true" 
value  of  X^  lies  within  the  interval.  Rather  it  means  that  we  are 
95^  "confident"  (in  a  subjective  sense)  that  this  is  the  case.  Stat¬ 
isticians  call  this  fiducial  probability,  as  opposed  to  a  true 
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* 

probability  statement . 

Using  the  prediction  interval  procedure  outlined  above,  we  can 
coD^jute  95^  prediction  intervals  for  for  numerous  specified  values 

A 

of  Xg.  The  following  are  illustrative  cases: 


^2 

A 

10 

lk.2 

± 

62.4  = 

-48.2 

and 

76.6 

38 

69.6 

± 

61.0  = 

8.6 

and 

130.6 

6o 

113.1 

±. 

61.9  = 

51.2 

and 

175.0 

90 

172.5 

-r 

66.0  = 

106.5 

and 

238.5 

120 

231.9 

± 

73.0  = 

158.9 

and 

304.9 

150 

291.2 

± 

81.8  = 

209.4 

and 

373.0 

Plotting  these  numbers  on  a  scatter  diagram  and  connecting  the  points, 
vre  obtain  a  95^  confidence  band  around  the  regression  line.  (See  the 
heavy  dashed  lines  in  Fig.  5  on  the  next  page.)  In  this  case  it  is 
clear  from  the  figure  that  the  95^  confidence  region  is  fairly  wide, 
reflecting  graphically  a  measure  of  the  uncertainty  associated  with 
the  estimating  equation.  This  is  rather  typical  of  analyses  based  on 
small  sanqjles.  The  equation  for  the  prediction  interval  is  constructed 
so  that  the  width  of  the  interval  is  quite  sensitive  to  variation  in 
sample  size  when  N  is  small.  Sensitivity  to  small  values  of  N  is 
loglcsil,  since  generalizations  based  on  very  small  samples  should  be 
subject  to  greater  uncertainty  than  those  based  on  a  larger  data  base. 


* 

See  Mood,  0£.  cit . ,  pp.  221-22. 
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Fig. 5 — Initial  tooling  cost  versus  airframe  weight 
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It  should  also  be  noted  that  because  of  the  (X^  “  term  in 

equation  (11)  the  prediction  interval  becomes  wider  as  Xg  is  selected 
farther  away  from  the  mean  of  (X^  =  38)  in  the  saii?)le.  Thus,  for 
example,  the  prediction  Interval  for  X^  =  X^  =  38  is: 

69.6  i  61.0  =  8.6  and  130.6; 

■vdiile  for  X^  =  200  (idiich  is  considerably  beyond  the  range  of  Xg  in 
the  sample),  the  95^  prediction  interval  is: 


390.2  ±  99.7  =  290.5  and  489.9. 

The  width  of  the  interval  in  the  latter  case  is  over  1.6  times  the 
width  for 


1.63. 


This  illustrates  in  a  rough  way  how  our  "confidence"  in  the  estimate 

decreases  as  we  extrapolate  beyond  the  range  of  the  san^^le  data  _ 

something  that  we  are  almost  always  required  to  do  in  cost  analysis  of 
advanced  wapon  systems  and  forces. 


The  width  of  the  prediction  interval  is  also  sensitive  to  the 
level  of  "confidence"  specified.  Up  to  now  that  level  has  been  set  at 
93^0  (i.e.,  e  =  0.05).  Suppose  that  only  a  70^  level  of  confidence  is 
desired  ( e  =  0.3).  The  only  thing  that  changes  in  the  Inputs  used  in 
the  previous  calculations  is  the  value  of  t.  Before,  we  used  t 

0  •  05 

* 

2.179;  now  we  use  t^^^  =  I.083.  This,  however,  makes  quite  a  difference 


Obtained  from  Table  3.8  in  Snedecor,  op.  cit.,  p.  65;  the  0.3 
column  and  the  d.f .  =  12  row. 


-25- 


in  the  width  of  the  prediction  interval  (now  a  705&  interval),  as  can 
be  seen  from  the  li^t  dashed  lines  in  Pig.  5  on  page  23.  Here,  since 
our  "confidence"  is  lower,  the  prediction  interval  can  be  narrower. 

For  lower  levels  of  confidence,  the  band  woiild  be  even  narrower.  How¬ 
ever,  for  a  given  level,  the  interval  obtained  by  the  prediction  in¬ 
terval  procedure  will  always  be  wider  than  an  interval  established  on 
—  * 

the  basis  of  S  alone.  This  is  apparent  from  Pig.  6  on  the  next  page. 
The  heavy  dashed  ciorves  denote  a  955^  prediction  region;  the  light 
dashed  lines  indicate  a  region  established  on  the  basis  of  (s)(t-  __)  = 

V  *  Up 

(25.0l)(2.179)«  Note  that  the  two  sets  of  boundaries  are  closest  to- 

A  ^ 

gether  vdiere  Xg  =  =  38. 


Up  to  this  point,  the  discussion  has  been  confined  largely  to 
statistical  regression  analyses  —  developing  an  estimating  (regression) 
equation  and  various  measures  of  uncertainty  pertaining  to  that  equa¬ 
tion.  From  an  estimating  point  of  view,  this  indeed  is  the  most  im¬ 
portant  part  of  the  analysis. 

There  is,  however,  another  form  of  statistical  analysis  called 
correlation  analysis.  Correlation  analysis  is  concerned  with  develop¬ 
ing  an  abstract  measure  of  the  degree  of  association  between  the 
dependent  variable  and  the  ejq>lanatory  variable  (or  variables).  In 
sir^ile  linear  regression,  the  most  commonly  used  measure  of  degree  of 

*  _ 

But  recall  the  point  made  previously;  S  can  only  be  used  to 

measure  variations  of  X^  the  sample  —  not  for  describing  the  'un¬ 
certainty  of  a  predicted  X^^. 


Init 
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association  is  the  correlation  coefficient,  denoted  by  r.  The  statistic 
r  is  constructed  in  such  a  vay  that  it  is  bounded  by  the  interval 
-  l£r-£+  1.  The  sign  indicates  whether  the  slope  of  the  regression 
line  is  positive  or  negative  —  i.e.,  \diether  the  regression  coefficient 
P  is  positive  or  negative.  At  the  boundaries  of  the  interval  for  r, 
we  have  the  cases  of  perfect  correlation:  r  =  +  1  (perfect  positive 
correlation),  r  =  -  1  (perfect  negative  correlation).  In  these  in¬ 
stances  all  of  the  sample  points  would  lie  exactly  on  the  regression 
line.  When  there  is  no  correlation  between  the  variables  whatsoever, 
r  =  0. 

While  correlation  is  a  somewhat  different  type  of  analysis  from 
that  discussed  previously,  it  is  nevertheless  related  in  a  definite 
way  to  regression  analysis.  In  order  to  see  this,  let  us  return  to 
the  concepts  of  total  variance,  explained  variance,  and  unexplained 
variance  referred  to  earlier  in  the  discussion  of  the  standard  error 
of  estimate.  Total  variance  pertains  to  the  deviations  of  the  sample 
X^'s  from  their  mean,  and  is  measured  by: 

2  -  X^)^ 

^  (N  =  sanQjle  size) 

Explained  variance  refers  to  the  deviations  from  X^  of  the  computed 

X^  values  (calculated  from  the  regression  equation)  corresponding  to 

* 

the  values  of  Xg  in  the  sample,  and  is  measured  by: 

_ 

That  is,  for  each  value  of  X^  in  the  sample,  a  corresponding 
value  of  the  dependent  variable  (X^ )  is  coi:5)uted  from  the  regression 
equation  X^  =  a  +  g  Xg . 
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E(X^  - 

_ c _ 

N 


E  (X^  - 

_ c 

N 


Intuitively  one  woiold  think  that  the  standard  error  of  estimate 


might  somehow  be  derived  from  the  unexplained  variance.  From  our  pre¬ 
vious  discussion,  we  recall  that  this  is  indeed  the  case.  The  stsindard 
error  of  estimate  (unadjusted)  is  the  square  root  of  the  une3q)lained 
variance. 

Similarly,  one  would  intuitively  think  that  the  correlation  co¬ 
efficient  (r)  might  somehow  be  derived  from  the  explained  variance. 

The  correlation  coefficient  is  in  fact  related  to  the  explained  variance. 


It  is  defined  as  the  square  root  of  the  proportion  of  total  variance 
that  is  represented  by  the  e3q)lained  variance.  That  is, 


We  now  See  the  interrelationship  among  r,  S,  and  the  regression 

* 

A  graphic  portrayal  of  total,  explained  and  unexplained  variance 
is  contained  in  Croxton  and  Cowden,  o£.  cit.,  pp.  662-63. 

X  N 

2 

r  is  sometimes  referred  to  as  the  coefficient  of  determination. 
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equation.  The  regression  eq.uation  is  used  to  determine  the  computed 

Xj^'s,  vdiich  are  inputs  to  the  calculation  of  both  r  and  S.  Also,  since 
2 

r  is  defined  as  a  proportion  of  total  variance,  r  and  S  in  a  sense 
have  an  inverse  relationship  to  one  another. 

Just  as  S  had  to  be  adjusted  for  sainple  size  —  particularly  so 
in  the  case  of  small  saii5)les  —  r  should  also  be  corrected.  (The 
formula  for  r  —  eq,uation  (12)  —  is  for  the  unadjusted  correlation 
coefficient. )„  In  the  case  of  simple  linear  correlation  the  value  of 
r  corrected  for  san^ile  size  is  as  follows: 

(„)  ' 


As  is  obvious  from  equation  (13),  the  effect  of  the  correction  dampens 
out  as  N  becomes  large.  For  very  small  samples,  as  in  our  illustra¬ 
tive  example,  the  correction  should  most  certainly  be  made. 

Returning  to  our  illustrative  example,  we  shall  now  compute  the 
correlation  coefficient.  The  formula  for  r  as  represented  by  equation 
(12)  is  rather  cumbersome  from  a  computational  point  of  view.  The 
following  shortcut  method  is  preferable; 


(14) 


(aSx^  +@2x^X2)  ■  ^^^1 


All  of  the  data  required  for  use  in  this  equation  have  been  computed 


See  Croxton  and  Cowden,  o£.  cit. ,  p.679. 

■JHfr 

For  discussion  and  derivation,  see  Croxton  and  Cowden,  o£.  cit., 
p.  671,  and  Appendix  B. 
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previously.  We  have  calcialated  the  regression  coefficients  to  he: 

^  =  “5*557^  3  =  1.9789*  The  mean  of  the  X^'s  and  the  summations 

are  contained  In  Table  2  on  page  10.  Substituting  these  data  in  equa¬ 
tion  (l4),  we  have: 

^2^  Ir5.$574)(973)^+^g.9789)(63^94lj  -  (69.5)(973) 

-5,407.4  +  130,490.6  -  67,623.5 
132,589  -  67,623.5 - 


=  5;l4^ 
^,965.5 


0.8845 


r  =  "V  0 . 8845  -  0.9405  (unadjusted) 

Using  formula  (13)  to  arrive  at  the  correlation  coefficient  adjusted 
for  sample  size: 


-2 


r 


(N  - 1)  - 1 

N  -  2 


(0.8845)(i4  -  1)  -  1 
l4  -  2 

11.4985  -  1  10.4985 

12  "  12 

=  0.8749* 

r  =  V  0.8749  =  0.9354 


Thus  in  this  analysis,  87  per  cent  of  the  total  variance  in  the 
sample  X^'s  is  "explained"  bv  the  regression  eqiiB.t.ion  Y  =  -5.56  + 

1.98X2. 
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The  fact  that  r  =  0.^,  seems  rather  impressive.  Bils  represents 
a  rather  "high"  correlation.  But  it  is  easy  to  be  misled  by  high 
correlation  coefficients.  So  much  so,  that  in  recent  years  there  has 
been  a  trend  away  from  the  former  en^jhasis  on  correlation  emalysls 
per  se  to  regression  analysis  which  stresses  the  derivation  of  struc- 
tureQ.ly  sound  estimating  relationships  and  of  measures  of  the  confidence 
that  the  user  might  have  in  the  estimating  equations. 

In  our  illustrative  example,  it  will  be  recalled,  the  measures  of 
the  unreliability  of  the  estimating  equation  seemed  rather  high.  The 
standard  error  of  estimate  in  relation  to  the  mean  of  the  sample 
is  high,  and  the  confidence  bajids  around  the  estimating  equation  would 
seem  to  be  rather  id.de  —  at  least  for  certain  applications.  Yet  the 
correlation  coefficient  turns  out  to  be  high,  indicating  a  close 
"degree  of  association"  between  and  X^.  This  leads  to  an  interest¬ 
ing  question:  How  can  the  measure  of  correlation  be  so  favorable  and 
yet  at  the  same  time  the  measures  of  unreliability  of  the  estimating 
eqmtion  tend  to  be  unfavorable?  Intuitively  one  can  see  why  this 
might  occur.  The  correlation  coefficient  is  in  a  sense  a  measure  of 
the  average  degree  of  association  between  the  variables.  However,  it 
is  conceivable  that  in  an  average  sense  the  degree  of  association  might 
be  quite  strong;  but  at  the  same  time  the  dispersion  or  "spread"  around 
the  average  may  be  fairly  wide,  thus  leading  to  considerable  uncer¬ 
tainty  or  unreliability  of  the  estimating  relationship. 

This  intuitive  explanation  appears  plausible,  but  let  us  see  if 
we  can  be  more  definitive.  In  order  to  do  this,  recall  the  previous 
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discussion  of  analysis  of  variance.  We  have: 

2  2 
ci  ^  =  unexplained  variance  (S  ) 

a  g  =  explained  variance 
2 

a  ^  =  total  variance 

Now  r  is  not  an  absolute  quantity,  but  rather  it  is  based  on  a  ratio 
of  the  explained  to  the  total  variance.  To  be  precise: 


On  the  other  hand,  S  ^  an  absolute  qmntity:  namely. 

Here  we  have  the  key  to  the  explanation  that  is  being  sought.  Not 
infrequently  the  san^jle  may  be  structured  in  such  a  way  that  the  total 
variance  ('^^)  will  tend  to  be  large.  Now  even  if  the  explained 
variance  (a^)  represents  a  high  fraction  of  a^,  and  if  is  large, 
there  is  still  plenty  of  room  for  (an  absolute  quantity)  to  be 
large;  hence  S  can  be  large,  especially  in  relation  to  the  mean  of  the 
san^jle  X^'s.  In  other  words,  the  explanation  hinges  on  the  fact  that 
_ 

For  the  sake  of  siii5)licity,  in  the  discussion  to  follow  we  shall 
use  the  unadjusted  S  and  r.  This  in  no  way  affects  the  main  line  of 
argiiment. 

One  circumstance  that  can  lead  to  a  large  is  unequal  dis- 

tribution  of  observations  within  the  saii5)le.  We  shall  illustrate  this 
point  later. 
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r  is  based  on  a  ratio  vhile  S  is  based  on  an  absolute  quantity;  and 
that  if  total  variance  is  large,  S  can  still  be  large  even  thou^  the 

p 

quantity  upon  which  it  is  based  (S  )  represents  a  declining  proportion 

2  2  2 

of  total  variance  as  r  =  o  fa .  increases. 

©  U 

IHae  relationship  between  the  standard  error  of  estimate  and  the 
correlation  coefficient  may  perhaps  be  seen  more  clearly  from  the 
following ; 

2  2  2 

By  definition,  a  .  =o  +o 
’  t  e  u 


Then,  dividing  throiagh  "bycr^,  we  have: 


1  = 


u 


(Recall  S 


To  get  S  we  may  start  with 

.T  2 


..  ■  .  ...... I  I  2  2/2  2 

^^Note  that  r  is  based  on  r  =  and  that  if  r  is  a  large 

fraction,  r  will  be  even  larger  since  it  is  the  square  root  of  a  ntmiber 
between  zero  and  unity. 


and  derive 


2  2 
t 


u 


u 


^2  2 


Returning  to  our  illustrative  example,  the  numerical  values  of 
* 

the  variances  are: 


Amount 

Fraction  of  Total 

2 

Unexplained 

536.1 

0.12 

Explained  (P^) 

ij-.lOij-.S 

0.88 

Total 

4, 64o.it- 

1.00 

Let  us  examine  these  statistics.  First,  considering  the  total 
variance,  it  is  not  immediately  obvious  -whether  is  large  or  small. 

We  can,  ho-wever,  readily  determine  that  it  is  large.  Taking  the 
square  root  of  =  4,6J)-0.4-,  we  obtain  the  standard  deviation  of  the 
san5)le  X^'s.  This  turns  out  to  be  =  68.12.  The  mean  of  the  X^'s, 
it  will  be  recalled,  is  X^  =  69.5*  Therefore,  in  this  case  the  stand¬ 
ard  deviation  is  essentially  equal  to  the  mean  --  a  situation  that 
clearly  indicates  a  -wide  dispersion  of  the  variable  under  consideration.** 

_ 

Again,  for  siii5)licity  we  shall  use  unadjusted  values  of  S  and  r. 

Recall  that  if  a  variable  is  normally  distributed,  an  interval 
defined  by  the  mean  +  1  standard  deviation  -would  include  about  6^  per 
cent  of  the  cases. 
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What  is  the  reason  for  the  large  standard  deviation  of  the  X^^'s  in  our 
illustrative  exainplel  Eie  answer  is  readily  apparent  from  Fig.  7  on 
the  next  paige.  Here  the  sample  observations  on  the  scatter  diagram 
are  shown  as  deviations  from  the  mean  of  the  (X^  =  69.5)  •  Note 

the  uneven  distribution  of  observations  in  the  saiople,  with  a  few  very 
large  values  on  the  hi^  end.  When  the  extremely  large  deviations  of 
these  few  observations  are  squared  —  as  they  must  be  in  the  calcula¬ 
tion  of  the  variance  of  the  —  they  have  a  magnified  impact  on 

2 

the  determination  of This  is  vby  it  is  desirable  to  have  a  more 

uniformly  distributed  sample;  but  often  this  is  not  possible,  and  we 

have  to  take  -vAiat  we  can  get.  Incidentally,  the  uneven  distribution 

of  observations  portrayed  in  Pig.  7>  while  having  an  "unfavorable".. 

2  2 

effect  on  does  not  necessarily  lower  and  hence  the  correlation 

coefficient  (r).  Referring  to  the  highest  observation  in  Pig.  J,  for 

exaaple,  the  deviation  of  the  observation  from  X^  (measured  by  the 

vertical  line )  contributes  heavily  to  the  magnitude  of  CJ  ^ .  However, 

the  line  AB  (measuring  the  deviation  of  the  computed  X^  from  X^)  enters 

into  the  calculation  of  explained  variance  (CJ^)  and  in  effect  "ejqjlains" 

most  of  the  total  variation.  (Note  again  the  "magnified"  effects  due 

to  squaring  these  deviations.)  In  such  cases,  the  correlation  tends 

to  become  someidiat  spurious,  and  because  of  this  we  should  concentrate 

on  regression  amlysis  rather  than  correlation  analysis. 

Returning  to  the  numerical  values  of  the  variances,  it  is  clear 
2 

that  CJ^  is  very  large  —  due  in  part  to  the  "spuriousness"  referred 

2 

to  earlier.  In  factCJ^  "explains"  88  per  cent  of  the  total  variance. 

2 

Hence  r  =  0.88,  and  r  =  0.9^.  But  as  indicated  previously,  this  is 


-3' 


-37- 


2 

misleading.  For  a  more  coB5)lete  picture,  we  must  turn  to  Here 

the  unexplained  variance  is  only  12  per  cent  of  total  variance.  How- 
ever,  in  absolute  terms  it  turns  out  to  be  fairly  large;  =  536, 

•vdiich  leads  to  S  =Vi3r=  23.  !Ehe  value  of  S  per  se  does  not  mean  very 
much.  For  perspective,  we  therefore  relate  it  to  the  mean  of  the  sample 
X^'s  and  obtain  s/x^  =  23/69-5  =  0.33-  A  ratio  this  high  does  indeed 
indicate  that  S  is  quite  large.  Thus,  in  our  numerical  example  we 
have  a  good  illustration  of  the  points  made  earlier  in  the  general 
discussion;  a  large  accounting  for  a  large  fraction  of  (and 

O 

hence  a  high  correlation) j  andcr^  accounting  for  a  small  fraction  of 

o  but  at  the  same  time  being  large  in  relation  to  the  mean  of  the 
t 

sample  X^'s. 

To  illustrate  these  points  still  further,  suppose  that  in  our 
nijmerical.  example  r^  were  even  higher,  say  0.95  (inplying  r  =  0.97). 

The  variance  table  would  then  be: 


Amount 

Fraction  of  Total 

2 

Unexplained  (0^) 

232.0 

0.05 

Explained  ('^g) 

hAo8.4 

0.95 

Total  (Cf^) 

4,640.4 

1.00 

Here,  S  ='\J 2^2  =  15.2j  and  S/Xj^  =  I5.2/69.5  =  0.22.  Thus  even  if 

^  ^  accounted  for  only  5  per  cent  of  the  ratio  of  s/x^  would  be 

* 

0.22  --  still  fairly  high. 

At  this  point  the  students  will  be  required  to  do  a  simple 
linear  regression  analysis.  (See  Appendix  B.). 
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IV.  A  CURVILETEAR  MAIZSIS;  LOGARITHMIC  REPRESSION 

Up  to  this  point  the  analysis  has  been  confined  to  sui^ile  linear 
regression.  While  a  first  examination  of  the  scatter  diagram  of 
vs.  Xg  indicates  that  a  linear  relationship  might  be  as  good  as  any¬ 
thing  else,  it  still  cannot  be  concluded  definitely  that  some  type  of 
non-linear  relationship  might  not  be  preferable.  We  shall  now  explore 
this  possibility. 

One  type  of  non-linear  relationship  that  is  very  frequently  used 
is  of  the  form: 

(15)  X^  =  aXg 

Equation  (15),  howver,  is  difficult  to  deal  with  statistically;  so 
usually  we  make  a  logarithmic  transformation  of  the  variables,  obtain¬ 
ing  an  equation  which  is  linear  in  the  logarithms  of  the  variables. 

(16)  log  X^  =  loga+  glog  Xg 

The  procedure  here  is  to  conduct  the  statistical  analysis  in  terms  of 
the  logarithms  of  the  variables  —  obtaining  estimates  of  log  a  and 
3  from  a  least  squares  fit  of  equation  (16)  and  then  transforming  back 
to  the  original  data  and  to  equation  (15 )•  This  approach  has  several 
advantages,  the  most  important  ones  being: 

(1)  We  can  proceed  almost  identically  to  the  simple  linear 
regression  case. 
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(2)  No  additional  degrees  of  freedom  are  lost*  —  an  ingjortant 
consideration  vftien  the  saaqjle  size  is  small. 

The  first  step  is  to  take  the  original  data  for  and  con¬ 
tained  in  Table  2  and  convert  these  data  to  logarithms.  The  cross 
product  and  the  sq^uares  are  then  computed,  and  the  summations  are  de¬ 
rived.  The  results  of  these  calculations  are  presented  in  Table  3. 

We  can  now  proceed  to  a  siD5>le  linear  regression  analysis  of  the  data 
in  logarithmic  form.  This  means  that  a  linear  regression  equation 
is  derived,  such  that  the  sum  of  squares  of  the  logarithms  of  the 
variables  aroiind  the  regression  line  is  at  a  minimum. 

The  "normal  eqiiations"  for  estimating  the  regression  coefficients 
are  the  same  form  as  before: 

(17)  Slog  Xj_  =  N  loga+  gnog  Xg 

(18)  s[(log  X^)(log  Xg)]  =  logaE  log  Xg  +gE(log  Xg)^ 

Substituting  the  required  summations  contained  in  Table  3  into  equa¬ 
tions  (17)  and  (18),  we  obtain; 

(19)  23.2383  =  l4  loga+  19.8177P 

(20)  3^.9241  =  19.8177  loga  +  30.13723 

Solving  equations  (19)  and  (20)  simultaneously  (using  the  same  procedure 
* 

Recall  that  in  a  regression  analysis,  degrees  of  freedom  means 
the  san5)le  size  minus  the  number  of  parameters  in  the  estimating  equa¬ 
tion.  Since  in  logarithmic  regression  there  are  only  two  parameters 
in  the  estimating  equation,  the  number  of  degrees  of  freedom  is  the 
same  as  for  simple  linear  regression;  N  -  2 
**• 

See  Croxton  and  Cowden,  0£.  cit.,  pp.  69^-99. 
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Table  3 

DATA  FOR  LOG -LINEAR  REGRESSION  ANALYSIS  OF  INITIAL  TOOLING  COST 

AND  AIRFRAME  WEIGHT 


Aircraft 

Type 

Log  X^ 

Log  Xg 

{log  X^)(lx)g  X^) 

(Log  X^)^ 

(Log  x^)' 

F-1 

0.9031 

0.8451 

0.7632 

0.8156 

0.7142 

F-2 

1.1761 

0.9031 

1.0621 

1.3832 

0.8156 

F-3 

1.3010 

0.9542 

1.24i4 

1.6926 

0.9105 

F-4 

1.6021 

1.1761 

1.8842 

2.5667 

1.3832 

F-5 

1.4771 

1.0792 

1.5941 

2.1818 

1.1647 

F-6 

1.5441 

1.3010 

2.0089 

2.3842 

1.6926 

F-7 

1.8451 

1.3979 

2.5793 

3.4o44 

1.9541 

B-1 

1.6990 

1.6021 

2.7220 

2.8866 

2.5667 

B-2 

2.4232 

2.0607 

4.9935 

5.8719 

4.2465 

B-3 

2.04l4 

1.6990 

3.4683 

4.1673 

2.8866 

B-4 

1.9294 

1.8451 

3.5599 

3.7226 

3.4o44 

B-5 

1.7782 

1.6990 

3.0212 

3.1620 

2.8866 

B-6 

1.3010 

1.3010 

1.6926 

1.6926 

1.6926 

B-7 

2.2175 

1.9542 

4.3334 

4.9173 

3.8189 

Totals 

23.2383 

19.8177 

34.9241 

40.8488 

30.1372 

SOURCE ;  and  X2  data  contained  in  Table  2,  converted  to  logarithms. 


Elog  X^ 
N 


23.2383 

“14 


1.6599 


^log  X^ 


19.8177 

l5 


1.4156 


N 


as 
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before),  the  estimates  of  log  cc  and  3  are  found  to  be; 
log  a  =  0.281824 

3  =  0.973516* 

The  regression  equation  for  the  logarithms  of  the  variables  is, 
therefore ; 

(21)  log  =  0.2818  +  0.9735  log 

Equation  (2l)  is  plotted  on  the  scatter  diagram  contained  in  Pig.  8 -A 
(the  solid  line).  Note  that  here  the  original  values  (arithmetic  form) 
of  X^  and  X^  are  plotted  on  a  chart  having  logarithmic  scales  on  both 
axes  (a  "log-log"  chart).  This  is  exactly  equivalent  to  plotting  the 
logarithms  of  the  variables  on  an  arithmetic  chart.  (See  Fig.  8-B.) 

The  standard  error  of  estimate  is  coniputed  as  before; 

E(log  X^)^  -logaElog  X^  -  3Ej;iog  X^)(log  X^)] 

40.8488  -  (0.28l824)(23.2383)  -  (0.973516) (34. 9241) 
_  14-2 

40.8488  -  6.9491  -  33.9992 
12 

=  =  0.02504 

=  V  0.02504  =  0.1582. 

log 

*Substituting  these  values  for  log  a  and  3  in  equation  (I9),  we 
obtain  a  check  as  follows; 

23.2383  =  (I4)(0.28l824)  -h  (19. 8177)(0. 973516) 

23.2383  =  3.9455  +  19.2928 
23.2383  =  23.2383. 
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In  Pigs.  8-A  and  8-B,  the  dashed  lines  Indicate  a  band  representing 
±  1  around  the  regression  line  log  =  0.282  +  0.974  log  X^. 

For  perspective,  the  value  of  S^^g  related  to  the  mean  of 

the  log  X^'s  in  the  sar^jle: 


^iQg  ^  0.1582 
Elog  X^7n  1.6599 


0.095. 


At  this  point  it  -would  appear  that  things  have  improved  markedly 
over  the  simple  linear  regression  case.  The  picture  portrayed  in 
Figs.  8-A  and  8-B  suggests  an  excellent  "fit"  to  the  data.  Also,  the 
standard  error  of  estimate  in  relation  to  the  mean  of  the  log  X^'s  is 
substantially  lower  than  in  the  simple  linear  regression  examgyle: 

10  per  cent  as  con^jared  with  36  per  cent. 

But  this  is  not  the  whole  story.  Up  to  now  the  analysis  has 
dealt  -with  the  logarithms  of  the  data.  The  analyst,  however,  is  not 
interested  in  estimating  log  X^  for  a  given  value  of  log  X^;  rather 
he  is  interested  in  making  estimates  in  terms  of  the  original  data. 

We  therefore  have  to  transform  the  logarithmic  analysis  back  to  an 
arithmetic  form.  When  this  transformation  is  made,  the  log-linear 
estimating  equation 


log  =  0.2818  +  0.9735  log  X^ 

becomes 

(22)  X^=  1.9135 

\tiere  1.9135  is  the  anti -log  of  log  a  =  O.2818.  Equation  (22)  is 
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plotted  on  the  scatter  diagram  contained  in  Fig.  9  ("the  solid  line). 

It  should  he  noted  that  in  this  case  equation  (22)  plots  as  a  straight 
line  over  the  range  of  Xg  shown  in  Pig.  9*  Since  the  exponent  of  Xg 
is  so  close  to  unity,  the  curvilinearity  implied  by  the  form  of  (22) 
does  not  show  up.  For  all  practical  purposes,  equation  (22)  plots 
X^  as  a  linear  homogeneous  function  of  Xg.  Note  also  that  the  regres¬ 
sion  line  does  not  appear  to  be  a  particularly  good  "fit"  to  the 
original  data  —  certainly  no  better  than  the  sinple  linear  estimating 
equation  obtained  previously. 

To  gain  further  Insight,  let  us  turn  to  the  standard  error  of 
estimate  and  conpute  a  ±  1  S  band  about  the  regression  line.  Again, 
ve  must  transform  the  logarithmic  analysis  into  an  arithmetic  one. 

This  may  be  done  in  two  ways.  One  way  is  to  carry  through  the  com¬ 
putation  in  terms  of  logarithms  and  convert  to  the  original  data  at 
the  very  end.  As  an  illustration,  assume  that  the  analyst  wants  to 
conpute  an  estimate  of  X^  for  Xg  =  100.  The  logarithm  of  100  is  2. 

We  have,  then; 

log  =  0.2818  +  0.9735(2) 

=  0.2818  +  1.9470 
=  2.2288 

log  Xj^  +  =  2.2288  +  0.1582  =  2.0706  and  2.387O, 

These  latter  two  numbers  are  converted  into  arithmetic  terms  by  taking 
the  anti -logarithms : 


anti -log  2.3870  =  243.8 
anti -log  2.0706  =  117.7 


Fig. 9 —  Init 
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The  +  S  interval  for  =  100  is,  therefore,  ll8  to  244.  (See  points 
A  and  B  in  Pig.  9* ) 

Another  approach  is  to  convert  from  a  logarithmic  to  an  arithmetic 
approach  iinmediately.  In  the  previous  method  we  had 

but  in  terms  of  the  original  data,  this  transforms  into  (anti -log  X^) 

(anti-log  for  the  "plus”  case,  and  (anti-log  X^)  7  (anti-log 

S,  )  for  the  "minus"  case.  First  we  need  the  anti-log  of  , 

-Log  °  log' 

■vAiich  is  1.4397.  In  the  previous  exan^jle  we  found  that  for  X^  =  100, 
log  =  2.2288.  The  anti -log  of  2.2288  gives  us  X^  =  169.34.  The 
+  S  interval  is,  therefore: 

(169.34) (1.4397)  =  243.8 

(169.34) (i/i.4397)  =  117.6, 

\^ich  is  the  same  as  that  obtained  by  the  first  nethod. 

As  another  example,  let  us  obtain  the  +  S  interval  for  =  30. 

From  the  regression  line  in  Pig.  9,  we  read  off  X^  2:  51  \dien  X^  =  30. 

The  interval  is: 

(51)(1.4397)  =  73 

(51)(i/i.4397)  =  35, 

Tdiich  when  plotted  gives  points  C  and  D  in  Pig.  9.  Connecting  points 
A  and  C  for  the  lower  bound  and  points  B  and  D  for  the  upper  bound,  we 
obtain  the  +  S  interval  aroimd  the  regression  equation  X^^  =  1.913X2°'^'^^. 

¥ 

Recall  that  addition  of  logarithms  is  equivalent  to  multiplica¬ 
tion  in  arithmetic  terms,  and  subtraction  of  logarithms  is  equivalent 
to  division  in  arithmetic  terms. 
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We  now  have  a  much  different  picture  than  that  indicated  in 
Figs.  8-A  and  8-B  for  the  logarithmic  analysis.  In  Pig.  9  the  ±  S 
interval  is  an  ever-widening  one  defined  in  terms  of  linear  homogeneous 
functions  of  Xg,  with  slope  1.44  for  the  upper  hound  and  slope  1/1.44 
=  0.69  for  the  lower  hovind.  (See  the  heavy  dashed  lines  in  Fig.  9.) 

Recall  that  in  our  simple  linear  regression  analysis  in  Section 

III,  S  =  25.  If  we  lay  off  +  25  around  the  regression  line  = 

1.913X2*^'^'^^,  the  results  are  the  light  dashed  lines  in  Fig.  9.  Here 

it  is  interesting  to  note  that  at  approximately  the  mean  of  the  saii5>le 

X^'s  (X^  =  38)  the  two  +  S  intervals  are  the  same  width.  For  ranges 

of  Xg  38,  the  +  25  interval  is  the  larger;  and  for  Xg  >  38,  the  +  25 

interval  is  the  smaller  of  the  two.  Thus,  idiile  for  the  very  low  range 

0 

of  values  for  Xg  we  might  prefer  using  X^  =  1.913Xg  ’  as  an  esti- 
mting  equation,  we  would  not  prefer  it  over  the  sinple  linear  regres¬ 
sion  equation  for  the  majority  of  the  range  of  Xg  in  the  saiig>le.  We 

0,974 

conclude,  therefore,  that  in  general  X^  =  1.913Xg  offers  no  im¬ 

provement  over  X^  =-5.58  +  1.98  Xg. 

The  logarithmic  example  contained  in  this  section  illustrates  a 
point  that  is  often  forgotten.  A  logarithmic  transformation  of  the 
variables  has  a  tendency  to  compress  and  shape  the  originsil  data  in 
such  a  way  that  a  statistical  fit  to  the  logarithms  "looks  good."- 
However,  as  pointed  out  previously,  we  are  not  interested  in  estimat¬ 
ing  the  logarithms.  Very  often  when  the  logarithmic  analysis  is  trans¬ 
formed  hack  into  terms  of  the  original  data,  the  results  do  not  appear 
so  impressive  —  as  was  the  case  in  our  example.  In  sum,  logarithmic 
transformations  can  be  tricky  and  misleading.  We  must  be  cautious 


when  using  them 
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V,  A  CURVILIMEAR  AHALTSIS;  SECOND  DECREE  SaUATION 

We  have  Just  seen  that  for  oxir  illustrative  example  a  logarithmic 
regression  does  not  seem  to  offer  any  is^rovement  over  the  simple 
linear  regression  case.  In  this  section  another  type  of  curvilinear 
regression  analysis  will  he  attengjted.  Here,  a  second  degree  equation 
will  be  used. 

The  estimating  equation  is  of  the  form: 

(23)  =  a+ 

In  this  case  three  parameters  must  be  estimated:  cl,  3^,  andgg.  In- 
stead  of  two  "normal  equations"  we  now  must  have  three.  They  are: 


(24) 

EX^  = 

aN  +e  3_EXg  +  3gEx| 

(25) 

iX^Xg  = 

aEXg  +e3_i:x^  +  ^2^4 

(26) 

aEX^  +3^Ex3  +3^2X2 

Most  of  the  summation  data  required  for  these  equations  eire  con¬ 
tained  in  Tiible  2:  namely, 

=  973 

^Xg  =  531 


•Jr  O 

Notice  that  by  adding  the  variable  Xg,  an  additional  degree  of 

freedom  is  lost.  In  the  simple  linear  regression  case,  degrees  of 
freedom  were  N  -  2  =  Ih  -  2  =  12.  Now  we  have  N-3=l4-3=ll 
degrees  of  freedom. 

■JHt 

See  Croxton  and  Cowden,  o£.  cit p.  706. 


r 


1 


-50- 

ZX^X^  =  65, 9^^! 

Ex|  = 

ZX^  =  132,589 

However,  the  following  supplementaiy  data  are  needed:  ZX^, 

and  EXg.  The  data  are  calculated  and  presented  in  Table  4. 

Substituting  these  summations  into  equations  (24),  (25),  and  (26): 

(27)  973  =  al4  -I-  Sj^531  +P 

(28)  65,941  =  a531  +  e3_34,8l3  +  p  22,945,187 

(29)  5,844,667  =  a34,8l3  +  0^2,945,187  +  ^^^0,375,^9 

These  equations  may  be  solved  simultaneously  by  repeated  use  of  the 
same  technique  that  was  used  in  Section  III.  Here,  we  shall  talce  (27) 
and  (28)  together  and  eliminate  ct;  do  the  same  thing  for  (28)  and  (29); 
take  the  resulting  two  equations  in  3^  and  P  ^  and  eliminate  3  solve 
the  result  for  PgJ  and  then  substitute  back  in  previous  equations  to 
get  a  and 

Following  this  procedure,  the  calc\ilation6  are  as  follows: 

Ratio  of  the  coefficients  of  a  in  eqmtion  (27)  and  (28)  = 

37.928571 

Multiplying  equation  (27)  by  37.928571  and  subtracting  the  re¬ 
stating  equation  from  (28); 


65,941  =  a531  +  Pi34,813  +  3 22,945,187 
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Table  k 

SUPPLEMENTARY  DATA  NEEDED  FOR  SECOND  DEGREE  REGRESSION  ANALYSIS 


Observation 

h4 

x3 

^2 

^2 

P-1 

392 

343 

2,401 

P-2 

960 

512 

4,096 

F-3 

1,620 

729 

6,561 

pjf 

9,000 

3,375 

50,625 

F-5 

4,320 

1,728 

20,736 

P-6 

14,000 

8,000 

160,000 

F-7 

43,750 

15,625 

390,625 

B-1 

80,000 

64,000 

2,560,000 

B-2 

3,504,625 

1,520,875 

174,900,625 

B-3 

275,000 

125,000 

6,250,000 

b-4 

416,500 

343,000 

24,010,000 

B-5 

150,000 

125,000 

6,250,000 

B-6 

8,000 

8,000 

160,000 

B-7 

1.336,500 

0 

0 

0 

65,610,000 

Total 

5.844,667 

280,375,669 
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36,905  =  a531  +  e3^20,ll^0  +  p2l,320,407 

(30)  29,036  =  gj^l4,673  +  32^,624,780 

Ratio  of  the  coefficients  of  cl  in  equations  (28)  and  (29)  = 

=  65.561205. 

Multiplying  equation  (28)  hy  65.561205  and  subtracting  the  re¬ 
sult  frcm  equation  (29): 

5,844,667  =  a34,8l3  +  3^2,945,187  33280,375,669 

4,323,171  =  a34,8l3  +  3^2,282,382  +  32193,090,009 

(31)  1,521,496=  3^662,805+3287,285,660 

Now  taking  equations  (30)  and  (31),  eliminate  3  by  multiplying 
equation  (30)  by  662,805/14,673  =  45.171744  and  subtracting  the  re¬ 
sult  from  equation  (3I): 

1,521,496  =  3^.662,805  +  3287,285,660 
i,3n,607  =  3^662,805  +  3373,394,146 

209,889  =  3313,891,514 

®2-  vMM  ■ 

Substituting  ^3  =  O.OI5IO9  in  equation  (30)  and  solving  for  32^: 
29,036  =  3j^14,673  +  (1,624,  78o)(. 015109) 
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29,036  =  3^14,673  +  24,5^9 
3^14,673  =  4,487 

=  0« 305800 

Substituting  3^  =  0.015109  and  3^  =  O.3058OO  In  equation  (27) 
and  solving  for  a; 

973  =  014  +  (.3058oo)(531)  +  (0.015109)(34,8l3) 

973  =  ai4  +  162  +  526 
ai4  =  285 
o  =  20.357143 

Checking  the  coii5)utatlons  by  substituting  the  derived  values  of 
ct,  3j^,  and  3  2  equation  (28): 

65,941  =  (20.357143)(531)  -  (0. 305800) (34,813) 

+  (0.015109) (2, 945, 187) 

65,941  =  10,810  +  10,646  +  44,499 

65,941  -  65,955 

Taking  the  estimates  of  o,  and  3^  derived  above,  the  estimat¬ 
ing  eqiiation  becomes; 

(32)  =  20.36  +  0.3058X2  +  0.0151X^ 

This  equation  is  plotted  on  the  scatter  diagram  contained  in  Pig,  10 
on  the  next  page  (the  solid  line). 
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Fig.  10  Initial  tooling  cost  versus  airframe  weight 
(2nd  degree  cose) 


-55- 


Ibe  standard  error  of  estimate  is  calcvilated  as  before,  except 
here  we  have  to  add  a  term  forg^  and  take  into  account  the  loss  of 
the  additional  degree  of  freedom.  !Ehe  formula  is : 


_2  -  (aZ  2X^X2  + 

S  =  - - - 


[20.3571] 


i4  -  3 


^4l)  -  (0.0151)(5.844.667] 


=  132.589  -  19.807  -  20.163  -  88.254 
11 

=  =  396.64 

S  =  V 396. 64  =  $19.92  (Millions) 

Relating  S  =  19.92  to  the  mean  of  the  sample  X^^'s: 


-|-= 

An  area  bounded  by  ±  IS  around  the  regression  line  is  presented  in 
Fig.  10  (the  dashed  lines). 

As  in  the  sin^jle  linear  regression  case,  a  prediction  interval  may 
be  calculated  for  a  value  of  obtained  from  the  estimating  equation 
for  specified  values  of  Xg  and  X^,  For  a  second  degree  regression, 
however,  the  calculation  is  somewhat  more  complicated.  Since  the  com¬ 
putational  procedure  required  here  is  the  same  as  that  for  m\iltiple 
regression  anai;jrsis,  we  shall  defer  the  subject  of  prediction  intervals 
until  the  following  section  (Section  VI)  on  multiple  regression  analysis. 

We  now  tiorn  to  calculation  of  the  measiores  of  correlation.  In 
curvilinear  analysis,  the  coefficient  of  curvilinear  correlation  is 
usually  referred  to  as  the  index  of  correlation  and  is  denoted  by  the 
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symbol  P .  P  ^  is  called  the  index  of  determination. 
2 

is: 


2 

=  c 

^4  -Vh 


The  formula  for 


■vdiere 

(35)  =  aEX^  +  . 

c  “ 

2 

Equation  (34)  gives  the  unadjusted  p  .  Particvilarly  in  the  case  of 

O 

small  RftmpiPBj  p  should  he  adjusted  for  degrees  of  freedom.  The 
following  formula  may  be  used  for  this  purpose: 

(36)  /  .  pV  -^1)  ;(»-!). 

;diere  m  is  the  nxmher  of  coefficients  in  the  regression  equation. 

(m  =  3  in  the  case  of  second  degree  regression.) 

Sxxbstituting  the  required  data  in  equations  (34)  and  (35 ve 

have; 

EX^  =  (20. 3571) (973)  +  (0.3058)(65,94i)  +  (0.015l)(5,844,667) 
c 

=  19,807  +  20,165  +  88,254 
=  128,226. 


128,226  -  ( 

[69.50] 

1(973) 

132,509  -  1 

[69.50) 

»(973) 

128.226  -  67,624 
132,589  -  67,624" 
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60.602 


0.9328 


(0.9328)(14  -  1)  -  (3  ^  1) 
l4  -  3 


12.1264  -  2 
11 


10.1264 

11 


0.9206 


p  =  V 0.9206  =  0.9595 . 


We  may  now  compare  the  results  of  the  statistical  analysis  for 
the  second  degree  regression  case  with  those  obtained  for  the  sia^ile 
linear  regression  exan^jle: 


SiB^jle  Linear  Second  Degree 


Regression 

Regression 

Standard  error  of  estimate 

$25  (million) 

$20  (million) 

Coefficient  of  Variation  (s/Xj^) 

0.36 

0.29 

Coefficient  (index)  of  deter¬ 
mination 

0.87 

0.92 

Coefficient  (index)  of  corre¬ 
lation 

0.94 

0.96 

Prom  these  data  it  would  appear  that  the  second  degree  regression  offers 
a  considerable  iiii5)rovement  over  the  simple  linear  case.  The  standard 
error  of  estimate  is  redxiced  by  $5  million,  the  coefficient  of  varia¬ 
tion  is  lower  by  7  percentage  points,  and  the  percentage  of  "e^^jlained" 
variation  is  higher  by  5  percentage  points.  Also,  the  regression  curve 
in  Fig.  10  would  appear  to  be  a  very  good  "fit"  to  the  saii5)le  data. 

_______ 

All  measures  included  here  are  adjusted  for  degrees  of  freedom. 
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The  real  question,  however,  is  \diether  the  in5)rovement  is  signifi¬ 
cant  in  a  statistical  sense.  It  must  not  be  forgotten  that  in  oxar 
illustrative  exanqjles  we  are  dealing  with  a  sangjle  of  data,  and  a  very 
sample  indeed.  It  is  conceivable,  therefore,  that  the  differences 
in  the  statistical  measiares  presented  above  could  be  attributable  pxrre- 
ly  to  saaq)ling  error,  and  that  in  the  "universe"  or  "population"  we 
were  attenqjting  to  describe,  there  is  in  reality  no  inqjrovement  in 
going  from  a  linear  to  a  second  degree  estimating  equation.  If  this 
were  to  be  so,  then  the  "in5)rovement"  we  have  observed  would  not  be 
regarded  as  "significant"  in  a  statistical  sense. 

In  order  to  resolve  a  question  like  this,  the  analyst  must  resort 
“to  a  statistical  testing  procedure  —  a  rather  complex  subject,  the 
details  of  vAiich  are  beyond  the  scope  of  the  present  discussion. 
Basically  x*at  is  involved  in  a  statistical  test  is  to  set  up  the 
hypothesis  that  the  observed  differences  are  in  effect  non-existent, 
and  then  let  the  testing  procedure  indicate  whether  the  hypothesis  is 
accepted  or  rejected  at  some  pre-specified  level  of  probability.  In 
a  specific  case,  for  exan^sle,  we  might  have  two  statistical  measures 
x^  and  with  a  difference  of  Ax.  llie  statistical  test  might  indicate 
that  the  chances  are  very  small  that  two  samples  drawn  from  the  assumed 
population  would  have  statistical  measures  leading  to  a  difference  as 
as  Ax.  In  other  words  it  would  seem  highly  unlikely  that  the 
observed  difference  could  be  attributable  to  sampling  variation.  If 
this  were  the  case,  we  would  conclude  that  the  difference  between  x^ 

^2  significant,  and  that  therefore  the  hypothesis  that  3^  =  Xg 
is  rejected. 


In  our  present  example,  the  author  did  conduct  such  a  test. 
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Specifically,  a  statistical  test  was  made  to  determine  ^diether  the 
difference  in  e;q)lained  variation  (O.87  for  siii5)le  linear  regression 
vs.  0.92  for  second  degree  regression)  is  significant.  Or,  stating 
the  problem  another  way,  a  test  was  made  to  determine  whether  the  in¬ 
cremental  increase  in  explained  variance  associated  with  the  addition 
of  the  variable  is  significant.  The  results  of  the  test  indicate 
that  the  chances  are  extremely  small  (less  than  1  in  20  in  this  case) 
that  the  observed  difference  could  be  attributable  to  sangiling  error 
alone.  We  conclude,  therefore,  that  the  statistical  results  of  the 
curvilineeu:  regression  €u:e  significantly  better  than  those  for  the 
sin^jle  linear  case. 


^i’or  a  discussion  of  the  testing  procedure  used, 
Cowden,  op.  cit.,  pp.  710-12. 

At  this  point  the  students  will  be  required  to 
regression  analysis.  (See  Appendix  C.) 


see  Croxton  and 


do  a  curvilinear 
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VI.  A  MOIfflPLE  REGgffiSSION  AHALTSIS 

In  Section  V  the  slu^jle  lineeu:  regression  exas^le  was  extended 
hy  Introducing  the  variable  into  the  estimating  equation,  result¬ 
ing  in  a  curvilinear  regression  analysis.  We  shall  now  go  back  to 
the  8inq)le  linear  case  and  consider  adding  a  new  variable  to  the  re¬ 
gression  equation.  This  takes  us  into  the  realm  of  multiple  regres¬ 
sion  analysis .  Since  here  we  shall  introduce  the  new  variable  in  a 
lineeur  fashion,  the  analysis  will  represent  one  class  of  multivariate 
analysis;  multiple  linear  regression. 

The  first  question  concerns  \diich  new  variable  to  introduce  into 
the  regression  equation.  As  indicated  in  Section  II,  in  addition  to 
Xg  (airframe  weight),  we  have  data  on  two  other  variables:  (maxi¬ 

mum  speed)  and  X|^  (combat  radius).  At  this  point  a  technical  con¬ 
sideration  must  be  raised.  The  multiple  regression  model  that  is  used 
in  the  analysis  to  follow,  postulates  that  the  explanatory  variables 
be  non-correlated.  We  must  therefore  examine  the  relationship  between 
Xg  and  X^,  and  Xg  and  Xj^.  While  the,*e  are  statistical  techniques  for 
testing  Tdiether  or  not  a  significant  correlation  exists  between  two 
explanatory  variables,  a  more  simplified  procedure  will  be  used  here. 
We  shall  merely  examine  the  scatter  diagrams  for  Xg  vs.  X^  and  X^  vs. 
X^.  These  are  presented  in  Pigs.  11  and  12.  From  Fig.  11  it  is  clear 
that  a  considerable  amount  of  correlation  exists  between  X^  and  X^. 
Prom  Fig.  12  it  would  seem  that  while  some  degree  of  association  may 
exist  between  Xg  and  X^,  the  correlation  is  certainly  not  very  great. 
Therefore,  X^  will  be  chosen  as  the  additional  variable  to  be  intro¬ 
duced  into  the  estimating  equation  which  will  be  of  the  form; 
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Fig.  1 1 — Airframe  weight  versus  combat  radius 
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Maximum  speed  (kn) 


Fig.  12 — Airframe  weight  versus  maximum  speed 
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(37)  ei2.3''2"Pl3.2^ 

The  normal  equations  required  to  obtain  estimates  of  the  peuraneters 
a,  andP3^3^2are:* 

(38)  2X3^=  Na+  312.3^*2  ^13.2^^ 

(39)  ^^^2  ^12.3^  '^^13.2^*2pS 

These  equations  may  be  solved  simultaneously,  using  the  same  procedure 
outlined  In  Section  V  for  the  second  degree  regression  case.  However, 
the  three  normal  equations  may  be  reduced  to  two  If  the  analysis  Is 
conducted  In  terms  of  deviations  from  the  means  of  the  variables. 
Denoting  deviations  by  small  x, 

*1  =  -  X^, 

the  normal  equations  become: 

(41)  ^^*2"  ^  12.3  ^^2  ^13.2  ^*2*3 

(42)  Ixj^x^  =  ^EXgX^  + 

and  a  Is  estimated  from 

For  our  Illustrative  exeu!:q>le,  many  of  the  required  summations 
*See  Croxton  and  Covden,  ogi.  clt..  pp.  756-61. 
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have  already  heen  calculated  (see  Table  2).  These  are: 


= 

973 

(X,  = 

69.50) 

SXg  = 

531 

(Xg  = 

37.93) 

sx^  = 

34,813 

= 

132,589 

%  = 

65,9^1 

The  additional  summations  needed  are: 


SXj. 

9,630 

(x^  =  687.86) 

^  V3  =■ 

659,500 

^  V3 " 

338,525 

sxf  = 

7,543,900 

In  order  to  use  equations  (4l)  and  (42),  the  deviations  from 
meains  must  be  derived  for  the  summations  contained  in  these  equations. 
Using  short-cut  methods,  the  necessary  calculations  are  as  follows: 


* 

See  Table  5* 

Derivation  of  the  short-cut  method  is  fairly  straightforward. 
2 

For  SXg,  for  example. 


^(Xg 

S(x^ 

-^2^2^ 

-  x^) 

sx^  - 

^2^X2 

+  NX| 

SX^. 

+  N(  SX/N) 

+  Xg  EXg 

sx^  - 

^2  ^^2. 
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Ihble  5 

SUPPUMStnCABY  DATA  NEEDED  PQR  MUIfflPIJB  REGRESSIOJ  ANALYSIS 
OP  Xj^  VS.  Xg  AND  Xj 


Cfbservratlon 

3=5 

XgX, 

=5 

P-1 

525 

4,200 

3,675 

275,625 

P-2 

575 

8,625 

4,600 

330,625 

F-3 

600 

12,000 

5,400 

360,000 

P-4 

750 

30,000 

11,250 

562,500 

P-5 

800 

24,000 

9,600 

640,000 

P-6 

1,100 

38,500 

22,000 

1,210,000 

P-7 

1,200 

84,000 

30,000 

1,440,000 

B-1 

525 

26,250 

21,000 

275,625 

B-2 

550 

145,750 

63,250 

302,500 

B-3 

1,100 

121,000 

55,000 

1,210,000 

b-4 

525 

44,625 

36,750 

275,625 

B-5 

330 

19,800 

16,500 

108,900 

b-6 

500 

10,000 

10,000 

250,000 

B-7 

550 

90,750 

49,500 

-302,500 

Total 

9,630 

338,525 

7,543.900 

SOURCE!  Table  1 


-66- 


Lxl-  DC^-X^EX^ 

=  34,813  -  (37.93)(53l) 

=  34,813  -  20,i4i  =  14.672 

2X2X2= 

=  338,525  -  (37.93)(9,630) 

=  338,525  -  365,266  =  -26,741 
2x^X2=  2x^X2 

=  65,941  -  (37.93)(973) 

=  65,941  -  36,906  =  29.035 

Ex|  =  ZX^-X^^X^ 

=  7,543,900  -  (687.86) (9,630) 

=  7,543,900  -  6,624,092  =  919.808 

2x^X2=  2x^X2  -X2^\ 

=  659,500  -  (687.86)(973) 

=  659,500  -  669,288  =  -9.788 

Substituting  these  values  in  equations  (41)  and  (42),  we  have 


(44) 

29,035  = 

H,672  S^2.3  ■  ^’■^'*^^13.2 

(45) 

-9,788  = 

-26,t4iP^2.3  919,808  P^2.2^ 
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and  solving  equations  (44)  and  (45)  ty  use  of  the  method  described  in 
Section  III,  the  "least  squares"  estimates  of  the  regression  coefficients 
are : 


^2.3  ■ 

^13.2  ” 

The  estimate  of  cl  is  obtained  from  equation  (43): 

(46)  a=  69.50  -  (2.o69179)(3T.93)  -  (0.049515) (687.86) 


=  69,50  -  78.48  -  34.06 


=  -43.04 

Combining  the  estimates  of  the  parameters  a  ,  3 12.3'  ^13.2' 

the  estimating  equation  is: 

(47)  =  -43.04  +  2.0692X2  +  0.049515X2 

This  equation  represents  a  linear  surface  in  three  dimensional  space. 

A  graphic  portrayal  is  presented  in  Fig.  13 . 

In  order  to  compute  the  standard  error  of  estimate  and  other 
statistical  measures,  the  quantity  "explained  sum  of  squares"  is  needed. 
It  is  derived  as  follows; 

=  aEX^  +  ^22.3  ^  ^3.2  ^^^3 

c 

=  (-43.o4)(973)  +  (2.o69)(65,94i)  +  (0.o4952)(659,500) 
=  -41,878  +  136,432  +  32,658 


127,212 


300 
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X 


Fig.  13  —  Multivariate  regressian  surface 


I 
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We  may  now  cad-ctilate  the  steindard  error  of  estimate  (adjusted) : 
Sx?  -  EX? 

s2  -  ^  c  _  132.589  ■  127.212 

=  =  1^88.82 

S  =  $22.11  (millions)^ 

and  the  coefficient  of  variation: 


c  =  s/x^  =  22.11/69.50  =  0.318 

The  coefficient  of  determination  (the  fraction  of  total  variation 


"explained"  by  and  X^)  is  calculated  as  before: 


** 


'4 


127.212  -  (69.50)(973) 

132,589  -  (89.50)(973) 


127.212  -  67.624  59,588 

r3275'89  -  67^^24  =  6^ 


0.9172, 


and  correcting  for  degrees  of  freedom: 


r2  ^  R^(K  -  1)  -  (m  -  1) 
N  -  m 


(0.9172)(i4  -  1)  -  (3  -  1) 


.  -  g  .  2i2|3§  .  0.5021 

E  =  0.9498. 


The  symbol  m  refers  to  the  nianber  of  parameters  in  the  regression 
eqviation  —  in  this  case  3* 

** 

In  multiple  regression  analyses,  R  is  used  to  denote  measures 
of  correlation. 
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As  in  the  case  of  sin^jle  linear  regression,  a  prediction  interval 
may  he  confuted  for  a  value  of  say  X^,  derived  from  the  estimating 

A 

equation  for  specified  values  of  the  explanatory  variables,  say  X^  and 
Hovever,  in  a  multivariate  analysis  the  procedure  for  determining 
the  equation  for  a  prediction  interval  is  considerably  more  complicated 
than  it  is  for  a  sin^jle  linear  regression  analysis. 

The  first  step  is  to  calculate  the  values  of  the  coefficients 
(the  c's)  in  the  following  set  of  equations,  which  is  a  modification 
of  "normal  eqiiations"  (4l)  and  (42); 

(48) ]  2 

'^32^^2^  ^  °33  ^3  "  ° 

(49)  ]  2 

"  ^32  ^  ^2^^  ^33  ~  ^ 

(C23  = 

The  required  summations  have  already  been  developed  (see  equations  (44) 

and  (45)).  The  values  of  these  summations  must  be  substituted  into 

equations  (48)  and  (49),  and  the  complete  set  solved  simultaneously 

for  c„^,  c„o  =  c_„,  and  c__.  A  routine  procedure  for  doing  this  is 
22^  23  32^  33 

contained  in  Table  6.  The  steps  are  explained  in  the  table,  result¬ 
ing  in  the  required  estimates  of  the  "c"  coefficients  in  the  lower  right 
hand  comer  of  the  table.  They  are; 

For  a  more  detailed  discussion,  see  A.  J.  Duncan,  Quality  Control 
and  Industrial  Statistics,  Richard  D.  Irwin,  Inc.,  Chicago,  1952,  pp.  527-38. 

y  y 

Also,  see  A.  J.  Duncan,  0£.  cit.,  p.  529* 


PROCEDURE  FOR  COMPUTING  VALUES  OF  THE  "c"  COEFFICIENTS 
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Cgg  =  0.00007197053 
°23  "  °32  "  0.000002092353 

0^2  =  0.000001148013 

These  coefficients  are  the  "basic  ingredients  contained  in  the 
equation  for  conqjuting  prediction  intervals  for  values  of  3^  obtained 
from  the  estimating  equation.  In  passing,  it  should  also  be  pointed 
out  that  these  same  coefficients  may  be  used  to  obtain  estimates  of  the 
regression  coefficients.  Instead  of  obtaining  the  regression  coef¬ 
ficients  from  equations  (4l)  and  (42),  as  we  did  previously,  we  may 
calculate  them  from  the  "c"  coefficients  as  follows: 

^12.3  °22  ^^^2  ®33 

=  (0.00007197053) (29, 035) 

+  (o.oooooii48qi3)(-9,788) 

=  2.089664  -  0.011237 
=  2.08 

C23  ®33  ^^^3 

(0. 000002092353 ) ( 29, 035 ) 

+  (0.000001i48013)(-9,788) 

0.060751  -  0.011237 


0.0495 
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Returning  now  to  the  subject  of  prediction  intervals,  the  equa¬ 
tion  for  a  95  per  cent  prediction  interval  for  an  estimate  of 
obtained  from  a  multivariate  estimating  equation  containing  two  ex- 
planatory  variables  is: 


(50) 


\  -  "^0.05  ^ 


^Jl/N 


°22^2  ^^23^2^ 


where 


=  the  estimate  of  obtained  from  the  estimating 
equation  for  specified  values  of  X^  and  X_  — 

A  A  ^  i 

say  Xg  and  X^ 


"0.05 

S 

N 


c. . 
11 


2 

x?  = 


the  value  of  Student's  "t"  distribution  at  the 
0.05  point  for  N-m  degrees  of  freedom 

the  standard  error  of  estimate 

sample  size 

the  calculated  values  of  c^g,  and  c^^  obtained 
from  equations  (48)  and  (49) 

(4  - 

'*3  ■ 


X2X3 


(Xg  -  Xg)(X^  -  X^) 


In  the  case  of  our  illustrative  example,  the  prediction  interval 
** 

equation  becomes: 


* 

Duncan,  0£.  cit.,  p.  531* 

For  the  value  of  t^  =  2.201,  see  Snedecor,  0£.  cit.,  p.  65. 

The  number  2.201  is  found  in  the  0.05  column  on  the  row  for  N-m  =  l4-3 
degrees  of  freedom. 
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(51)  \  ±  (2.201)(22,11)  Y  l/l4  +  0.00007197X2  +  O.OOOOOUWxl 

H-  (2)( 0.000002092 )X2X2  +  1 

To  illustrate  the  use  of  equation  (5l)>  assume  that  we  want  to  establish 
a  95  per  cent  prediction  interval  for  derived  from  the  estimating 
equation  with  Xg  =  70  and  =  500*  Substituting  Xg  =  70  and  =  500 

A 

into  equation  (4-7)^  we  find  the  estimate  of  X^  to  be: 

^  =  -43.04  +  (2.o692)(70)  +  (0.04952) (500) 

=  -43.04  +  144.84  +  24.76 
=  127. 

We  then  con^jute  the  deviations  frcam  means : 

4  =  ^^2  "  ^2^^  =  =  (32.1)^  =  1,030 

x|  =  (X^  -  X^)^  =  (500  -  687.9)^  =  (-187.9)^ 

=  35,306 

XgX^  =  (Xg  -  -  X3)  =  (70  -  37.9)(500  -  687.9) 

=  (32.1)(-187.9)  =  -6,032, 

and  substitute  the  results  into  equation  (5I)  obtaining; 

127  ±  (2.201)(22.11)^1/i4  +  (0. 00007197) (l,030) 

+  (0.000001148)(35,306)  +  (2)(0.000002092)(-6,032)  +  1 
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=  127  ±  48.66  yo.orp.  +  0.074  +  o.o4l  -  0.025  +  l 

=  127  +  48.66^1*161 
=  127  +  (48.66)(1.077)  =  127  +  52.41 

=  179  and  75  . 

As  in  the  case  of  sin^jle  linear  regression,  the  prediction  interval 

A  A 

becomes  wider  as  Xg  and  are  selected  farther  away  from  the  sample 
means  Xg  and  X^.  In  the  illustrative  exanqjle,  if  we  choose  X^  =  X 

A  ^ 

=  X^,  the  prediction  interval  would  be; 

127  +  48.66  Vl/1^  +  1  =  127  +  48.66~\/i.071 

=  127  +  (48.66) (1.035) 

=  127  +  50 

=  177  and  77  . 

In  this  case  the  prediction  interval  is  at  its  narrowest  width. 


We  may  now  summarize  the  results  for  the  multivariate  regression 
and  compare  them  with  the  statistical  measures  obtained  for  the  simple 
linear  regression  case;* 


All  measures  included  here  are  adjusted  for  degrees  of  freedom. 
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standard.  error  of  estimate 
Coefficient  of  variation  (s/3^) 
Coefficient  of  determination 
Coefficient  of  correlation 


Single 

Linear 

Regression 

$25  (million) 

0.36 

0.87 

0.94 


Multivariate 

Regression 

$22  (million) 

0.32 

0.90 

0.95 


From  these  data  it  would  appear  that  the  addition  of  into 
the  estimating  equation  has  inQjroved  the  situation  —  but  only  slightly. 
As  before,  when  the  curvilinear  regression  was  compared  with  the 
simple  linear  case,  the  real  question  is  whether  the  inqjrovement  is 
really  significant,  or  whether  it  may  be  attributable  purely  to  san^)!- 
Ing  variation.  Again,  this  requires  a  statistical  test.  The  author 
performed  such  a  test,  and  found  that  in  this  case  the  improvement  is 
not  significant.  In  other  words,  the  net  increment  of  eiiplained 
variance  associated  with  the  introduction  of  (after  allowance  for 
the  loss  of  an  additional  degree  of  freedom)  is  not  sufficient  to 
enable  us  to  be  reasonably  confident  that  the  improvement  is  not  due 
to  chance. 


This  is  often  the  case  in  multiple  regression  analyses  involving 
very  small  samples.  The  loss  of  an  additional  degree  of  freedom 
tends  to  reduce  the  incremental  improvement  in  explained  variance, 
often  to  the  point  where  the  improvement  is  not  significant  from  a 
statistical  point  of  view.  In  our  multivariate  regression  illustra¬ 
tive  example,  the  results  of  the  statistical  test  lead  us  to  the 
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concluslon  that  the  estimating  equation  for  as  a  linear  function 

of  Xg  and  Is  statistically  no  better  than  the  equation  involving 

* 

X^  as  a  linear  function  of  Xg  alone* 

* 

See  Appendix  D  for  student  problem  in  multiple  regression 
analysis. 
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Appendix  A 

DERIVATIOM  OF  THE  NORMAL  EQUATIONS  FOR  A  LINEAR  NORMAL  REGRESSIOM 

The  problem  is  to  find  the  values  of  a  and  3  which  will  minimize 
the  expression; 

(1)  S  -  (a  +  pX^)]^ 

Expanding  this  expression,  we  obtain: 

(2)  ^  =  SX^  -  2aEX^  -  2pEX^X2  +  Na^  +  2a3EX2  +  P^EX^ 
Differentiatir.g  (2)  partially  with  respect  to  a  and  3  : 

(3)  -|^  =  -2EX^  +  2Na+  2eEX2 

(^)  -l^-  =  -2  EX^Xg  +  2aEX2  +  23EX^ 

Since  for  (2)  to  be  at  a  minimum  the  partial  derivatives  of  <|)  with 
respect  to  o  and  9  must  be  zero,  we  set  (3)  and  (4)  eq,ual  to  zero  and 
obtain  the  so-called  "normal"  equations; 

-2EX^  +  2Na  +  23E  X^  =  0 

(5)  I 

-2EX^X2  +  2aEX2  +  23EX2  =  0, 

or, 

EX^  =  Na  +3EX2 

(6)  < 

2x^X2  =  aEX2  +P2x|  , 
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Appendlx  B 

STUEBffln?  PROBLEM  IN  SIMPLE  IJMEAR  REGRESSION  ANALYSIS 

Using  the  discussion  contained  in  Section  III  as  a  guide,  the 
students  will  conduct  a  simple  linear  regression  analysis  of  initial 
tooling  cost  {X^)  vs.  combat  radius  (Xj^).  Olie  basic  data  for  this 
exercise  are  included  in  Table  1  (page  4),  and  the  scatter  diagram  of 
X^  vs.  Xj^  is  presented  in  Fig.  3  (page  7). 

The  students  are  required  to  develop  the  following: 

(1)  The  estimating  (regression)  equation  for  X^  as  a  linear 
function  of  Xj^ 

(2)  Standard  error  of  estimate  (adjusted) 

(3)  Coefficient  of  variation  (s/5^) 

(4)  A  95^  confidence  band  around  the  regression  line  (show 
on  a  chart) 

(5)  Coefficient  of  determination  (adjusted) 

(6)  Coefficient  of  correlation  (adjusted) 

Question  for  the  students:  Do  you  think  that  the  estimating 
equation  for  X^  vs.  Xj^  is  preferable  to  the  one  for  X^  vs.  Xg  developed 
Why? 


in  Section  III? 
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Appendix  C 

STUDEtJT  PROBIiM  IN  CURVILINEAR  REGRESSION  ANALYSIS 

Using  the  discussion  contained  in  Section  V  as  a  guide,  the 
students  will  conduct  a  second  degree  regression  analysis  of  initial 
tooling  cost  (X^)  vs.  combat  radius  (X^^). 

The  students  are  required  to  develop  the  following: 

(1)  The  second  degree  estimating  (regression)  equation  for 
X^  as  a  function  of  and 

(2)  Standard  error  of  estimate  (adjusted) 

(3)  Coefficient  of  variation  (s/x^) 

(4)  A  scatter  diagz^  containing  a  plot  of  the  regression 
equation,  along  with  a  band  indicating  +  IS  around  the 
estimating  equation 

(5)  Index  of  determination  (adjusted) 

(6)  Index  of  correlation  (adjusted) 

Question  for  the  students:  Do  you  think  that  the  second  degree 
estimating  equation  might  be  preferable  to  the  simple  linear  case 
developed  in  the  previous  exercise?  Why? 
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Appendlx  D 

STUDENT  PROBLEM  IH  MULTIPLE  REGRESSION  ANALYSIS 

Using  the  discussion  cont&ined  in  Section  VI  as  a  guide  the 
students  vlU  conduct  a  multiple  regression  analysis  of  initial  tool¬ 
ing  cost  (Xj^)  as  a  linear  function  of  maximum  speed  (X^)  and  combat 
radius  (X|^).  The  basic  data  for  this  analysis  are  contained  in  Table 
1  on  page  4. 

A  scatter  diagram  of  X^  vs.  X|^  is  presented  in  Fig.  l4  on  the 
next  page.  From  the  figure  it  is  clear  that  the  correlation  betveen 
X^  and  Xj^  is  not  very  high.  These  two  variables  may  therefore  be 
used  as  explanatory  variates  in  a  multiple  regression  analysis. 

The  students  are  required  to  develop  the  following: 

(1)  The  estimating  (regression)  equation  for  X^  as  a  linear 
function  of  X^  and  X^^^ 

(2)  Standard  error  of  estimate  (adjusted) 

(3)  Coefficient  of  variation  (s/)^) 

(4)  The  eqmtion  for  deriving  95  per  cent  prediction 
intervals  for  values  of  obtained  from  the  estimating 
equation 

(5)  Coefficient  of  determination  (adjusted) 

(6)  Coefficient  of  correlation  (adjusted) 

Question  for  the  students:  Do  you  think  that  the  regression  of 
vs.  X^  and  might  be  preferable  to  (l)  that  for  vs.  X^;  (2) 
that  for  X^  vs.  Xg  and  X^?  Why? 
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Qdius 


